-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ML] Improve hyperparameter tuning performance #1941
Conversation
… to avoid runtime blowup
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good work on improving the performance by integrating function smoother into the minimizer! I have a couple of comments mostly regarding readability.
BOOST_REQUIRE_CLOSE_ABSOLUTE( | ||
0.0, bias, 4.0 * std::sqrt(noiseVariance / static_cast<double>(trainRows))); | ||
// Good R^2... | ||
BOOST_TEST_REQUIRE(rSquared > 0.98); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice!
// Test minimization of some training loss curves from boosted tree hyperparameter | ||
// line searches for: | ||
// 1. Miniboone | ||
// 2. Car-parts | ||
// 3. Boston |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good work! 🚀
Thanks for the review @valeriy42! I think I've addressed everything. I also disabled writing out extra stats and model metadata for the time being. This requires changes to the Java code as well and I'll make those together. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good work and thank you for this explanation of handling mean-variance. I have just a couple of minor comments: take it or leave it. LGTM 🚀
TSizeVecVec testingMasks; | ||
this->setupMasks(numberFolds, trainingMasks, testingMasks); | ||
|
||
TDoubleVec K(17); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would have preferred to call m_K
an m_SmoothingParameter
. Due to our coding standards, it has to be a capital K
, although it relates to the small k
in the formulas.
// So what are we doing here? When we supply function values we also supply their | ||
// error variance. Typically these might be the mean test loss function across | ||
// folds and their variance for a particular choice of hyperparameters. Sticking | ||
// with this example, the variance allows us to estimate the error w.r.t. the | ||
// true generalisation error due to finite sample size. We can think of the source | ||
// of this variance as being due to two effects: one which shifts the loss values | ||
// in each fold (this might be due to some folds simply having more hard examples) | ||
// and another which permutes the order of loss values. A shift in the loss function | ||
// is not something we wish to capture in the GP: it shouldn't materially affect | ||
// where to choose points to test since any sensible optimisation strategy should | ||
// only care about the difference in loss between points, which is unaffected by a | ||
// shift. More formally, if we assume the shift and permutation errors are independent | ||
// we have for losses l_i, mean loss per fold m_i and mean loss for a given set of | ||
// hyperparameters m that the variance is | ||
// | ||
// sum_i{ (l_i - m)^2 } = sum_i{ (l_i - m_i + m_i - m)^2 } | ||
// = sum_i{ (l_i - m_i)^2 } + sum_i{ (m_i - m)^2 } | ||
// = "permutation variance" + "shift variance" (1) | ||
// | ||
// with the cross-term expected to be small by independence. (Note, the independence | ||
// assumption is reasonable if one assumes that the shift is due to mismatch in hard | ||
// examples since the we choose folds independently at random.) We can estimate the | ||
// shift variance by looking at mean loss over all distinct hyperparameter settings | ||
// and we assume it is supplied as the parameter m_ExplainedErrorVariance. It should | ||
// also be smaller than the variance by construction although for numerical stability | ||
// we prevent the difference becoming too small. As discussed, here we wish return | ||
// the permutation variance which we get by rearranging (1). | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice 👍 Thank you for the explanation! 📖
…1960) This makes two changes to deal better with small data sets highlighted by a failure in our QA suite as a result of #1941. In particular, 1. We could miss out rare classes altogether from our validation set for small data sets. 2. We can lose a lot of accuracy by over restricting the number of features we use for small data sets. Problem 1 is a result of the stratified sampling we perform. If a class is rare and data set is small we could choose never to sample it in the validation set because it could constitute fewer than one example per fold. In this case, the fraction of each class is changing significantly in the remaining unsampled set for each fold we sample, but we compute the desired class counts once upfront based on their overall frequency. We simply need to recompute desired counts per class based on the frequencies in the remainder in the loop which samples each new fold. Problem 2 requires that we allow ourselves to use more features than are implied by our default constraint of having n examples per feature for small data sets. Since we automatically remove nuisance features based on their MICe with the target we typically don't suffer loss in QoR from allowing ourselves to select extra features. Furthermore, for small data sets runtime is never problematic. For the multi-class classification problem which showed up this problem accuracy increases from around 0.2 to 0.9 as a result of this change.
Our default hyperparameter tuning becomes extremely runtime expensive on large data sets. Currently, our best solution is to suggest using a low train fraction, but this means final train (which is comparatively a couple of orders of magnitude faster) also sees less data. This can be addressed by using a smaller proportion than 1 - 1 / # folds data for train during hyperparameter tuning. This provides us with both an escape hatch to avoid pathological run time in the case someone runs against a very large data set and also the means of allowing the user to specify train fast, but sacrifice a small amount of accuracy. This also introduces a hard limit on the maximum number of rows we will use for hyperaparameter tuning and enforces this by selecting a lower train fraction when this would be exceeded. For small data sets losses at different hyperparameter settings are typically noisy and our current strategy of fitting a parabola through the points works well. For large data sets the loss curve is smooth, but often non-parabolic over the range we explore. Fitting a LOWESS regression to the loss curve instead performs pretty well for finding the "true" minimum (also better than interpolation by GP which I tried). The issue with different amounts of noise is handled by choosing the amount of smoothing by maximum likelihood. This change generally gives us a small performance bump and importantly means we can often likely skip fine tuning altogether. As such I now allow max_optimization_rounds_per_hyperparameter to be set to zero. One last issue was when we characterised loss variance across the folds we included variance in the mean loss. Particularly, for small data sets this could be due to sampling affects: one test fold contains examples which are harder to predict. Over multiple rounds we can estimate this effectively and remove this component of the variance when we fit the GP.
…lastic#1960) This makes two changes to deal better with small data sets highlighted by a failure in our QA suite as a result of elastic#1941. In particular, 1. We could miss out rare classes altogether from our validation set for small data sets. 2. We can lose a lot of accuracy by over restricting the number of features we use for small data sets. Problem 1 is a result of the stratified sampling we perform. If a class is rare and data set is small we could choose never to sample it in the validation set because it could constitute fewer than one example per fold. In this case, the fraction of each class is changing significantly in the remaining unsampled set for each fold we sample, but we compute the desired class counts once upfront based on their overall frequency. We simply need to recompute desired counts per class based on the frequencies in the remainder in the loop which samples each new fold. Problem 2 requires that we allow ourselves to use more features than are implied by our default constraint of having n examples per feature for small data sets. Since we automatically remove nuisance features based on their MICe with the target we typically don't suffer loss in QoR from allowing ourselves to select extra features. Furthermore, for small data sets runtime is never problematic. For the multi-class classification problem which showed up this problem accuracy increases from around 0.2 to 0.9 as a result of this change.
…1992) This makes two changes to deal better with small data sets highlighted by a failure in our QA suite as a result of #1941. In particular, 1. We could miss out rare classes altogether from our validation set for small data sets. 2. We can lose a lot of accuracy by over restricting the number of features we use for small data sets. Problem 1 is a result of the stratified sampling we perform. If a class is rare and data set is small we could choose never to sample it in the validation set because it could constitute fewer than one example per fold. In this case, the fraction of each class is changing significantly in the remaining unsampled set for each fold we sample, but we compute the desired class counts once upfront based on their overall frequency. We simply need to recompute desired counts per class based on the frequencies in the remainder in the loop which samples each new fold. Problem 2 requires that we allow ourselves to use more features than are implied by our default constraint of having n examples per feature for small data sets. Since we automatically remove nuisance features based on their MICe with the target we typically don't suffer loss in QoR from allowing ourselves to select extra features. Furthermore, for small data sets runtime is never problematic. For the multi-class classification problem which showed up this problem accuracy increases from around 0.2 to 0.9 as a result of this change.
Our default hyperparameter tuning becomes extremely runtime expensive on large data sets. Currently, our best solution is to suggest using a low train fraction, but this means final train (which is comparatively a couple of orders of magnitude faster) also sees less data. This can be addressed by using a smaller proportion than
1 - 1 / # folds
data for train during hyperparameter tuning. This provides us with both an escape hatch to avoid pathological run time in the case someone runs against a very large data set and also the means of allowing the user to specify train fast, but sacrifice a small amount of accuracy. I tried this out on a few different data sets and the following result is typical:* This is the default behaviour. The performance gain is not proportional to the fraction of data used because:
(Note compared to downsampling while training this has a couple of advantages if runtime is a priority: it guaranties worst case performance and it improves caching since the same rows are always selected rather than a different random sample for each tree.)
This also introduces a hard limit on the maximum number of rows we will use for hyperaparameter tuning and enforces this by selecting a lower train fraction when this would be exceeded.
A second improvement relates to initial hyperparameter search. I observed that
The second problem is slightly tricky. For small data sets losses at different hyperparameter settings are typically noisy and our current strategy of fitting a parabola through the points works well. For large data sets the loss curve is smooth, but often non-parabolic over the range we explore. Fitting a Lowess regression to the loss curve instead performs pretty well for finding the "true" minimum (also better than interpolation by GP which I tried). The issue with different amounts of noise is handled by choosing the amount of smoothing by maximum likelihood. This change generally gives us a small performance bump and importantly means we can often likely skip fine tuning altogether. As such I now allow
max_optimization_rounds_per_hyperparameter
to be set to zero.One last issue was when we characterised loss variance across the folds we included variance in the mean loss. Particularly, for small data sets this could be due to sampling affects: one test fold contains examples which are harder to predict. Over multiple rounds we can estimate this effectively and remove this component of the variance when we fit the GP.